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ABSTRACT 

Parallel  computer  architecture  complicates  the  already 
difficult  task  of  parallel  programming  in  many  ways,  ;e.g.,  by 
a  rigid  interconnection  structure,  addressing  comp/lexity, 
and  shape  and  size  mismatches.  The  CHiP  computer  is  a  new 
architecture  that  reduces  these  complications  by  permitting 
the  processor  interconnection  structure  to  be  programmed. 

This  new  kind  of  programmming  is  explained.  Algorithms  are 
presented  for  several  inter  connection  patterns  including  the 
torus  and  the  complete  binary  tre'.  and  general  embeddiny^-. 
strategies  are  identified.  / 
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Introduction 

Although  it  is  a  difficult  task  to  design  a  sequential  computer  archi¬ 
tecture  that  efficiently  hosts  sequential  algorithms,  it  is  perhaps  even 
more  challenging  to  design  a  parallel  architecture  that  efficiently  hosts 
parallel  algorithms.  The  aspects  of  parallel  computation  that  frustrate 
the  harmonious  match  between  algorithm  and  architecture  are  many: 
Rigid  interconnection  structure:  Parallel  architectures  tend  to  pro¬ 
vide  a  fixed  interconnection  structure  between  processing  elements 
(PE's).  For  example,  1LL1AC  IV  is  mesh  connected;  the  Massively 
Parallel  Processor  [1]  has  a  toroidal  structure.  But  recently 
developed  parallel  algorithms  use  a  variety  of  PE  interconnection 
structures.  For  example,  there  are  tree  algorithms  for  everything 
from  sorting  to  graph  coloring  [2]  as  well  as  applicative  language 
expression  evaluation  [3],  hexagonally  connected  pipelined  algo- 

»The  research  described  herein  is  part  of  the  Blue  CHiP  Project,  Funding  is  provided  in  part 
by  the  Office  of  Novel  Research  under  Contract  N  >.  N00014-80-K-0818  and  Contract  No. 
N00014-81-K-0380,  Special  Research  Opportunities  Program,  Task  SR0-100, 


rithms  for  numeric  problems  [4],  "double  trees '  for  searching  and 
data  base  operations  [5],  and  many  nonstanderd  interconnection 
graphs.  (See  Figure  1.)  The  problem  is  that  t'w  rigid  interconnec¬ 
tion  structure  biases  the  architecture  towards  a  particular  class  of 
algorithms  and  makes  it  difficult  to  use  foi  any  other  class  of  algo¬ 
rithms. 

Problem  shape  and  size  mismatch:  Parallel  algorithms  tend  to 
require  a  particular  number  of  PE's  in  a  particular  shape  that  is 
determined  by  the  problem’s  input,  but  the  architecture  provides 
only  one  fixed  size  and  shape.  For  example,  an  algorithm  requiring 
an  n/2  x  2n  array  of  PE’s  does  not  "fit"  on  an  nxn  mesh  connected 
architecture  even  though  there  are  enough  processors. 

Addressing  complexity:  Certain  parallel  architectures,  e.g.,  the  Ultra 
Computer  [6]  and  the  Cube  connected  cycles  [?],  provide  a  "univer¬ 
sal"  interconnection  structure  in  which  a  logical  interconnection 
structure  is  implemented  on  the  physical  structure  by  means  of 
packet  routing  operations.  Time  is  wasted  in  unproductive  packet 
switching.  More  seriously,  the  programs  stored  in  the  PE’s  are  com¬ 
plicated  by  the  need  to  compute  target  addresses. 

Paucity  of  programming  languages:  Although  languages  such  as  APL 
and  Concurrent  Pascal  have  "parallel  semantics,"  most  parallel  algo¬ 
rithms  are  specified  in  an  ad  hoc  manner.  Thus  there  is  little  gui¬ 
dance  from  the  programming  language  as  to  what  features  tr  optim¬ 
ize  for. 

These  and  other  complications  explain  in  large  measure  why  highly  paral¬ 
lel  computers  have  been  difficult  to  program. 
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We  report  on  a  new  family  of  architectures,  the  Configurable,  Highly 
Parallel  (CHiP)  computers,  that  respond  to  the  demands  of  parallel  algo¬ 
rithms,  especially  the  need  for  locality  and  flexibility.  The  central  con¬ 
cept  is  this: 

The  processing  elements  are  embedded  into  a  programmable  switch 
lattice  that  permits  not  only  the  programming  of  the  PE's  but  also 
the  direct  programming  of  their  interconnection  structure. 

This  second  kind  of  programming  not  only  ameliorates  the  difficulties 
mentioned  above,  it  also  permits  the  convenient  composiiion  of  parallel 
algorithms.  It  has  even  led  to  the  development  of  entirely  new  parallel 
algorithms  [8],  In  this  paper  we  give  a  synopsis  of  the  CHiP  architecture 
and  then  explore  the  consequences  of  this  new  kind  of  programming, 
interconnection  structure  programming.  The  main  results  are  algo¬ 
rithms  of  programming  various  interconnection  structures. 

Synopsis  of  the  CHiP  Computer 

[Readers  familiar  with  the  CHiP  Computer  may  wish  to  omit  this  sec¬ 
tion.] 

A  CHiP  Computer  is  composed  of  a  set  of  homogeneous  microproces¬ 
sor  elements  connected  at  regular  intervals  to  the  switches  of  the  switch 
lattice.  The  lattice  is  composed  of  programmable  switches  connected  by 
data  paths  to  each  other  or  to  the  PE's.  Perimeter  switches  are  attached 
to  external  storage  devices.  Figure  2  illustrates  two  examples  of  this 
structure.*  Each  PE  has  its  own  local  program  and  data  memory  and 

_  CjD 

♦Notice  that  the  pictures  are  not  drawn  to  scale,  The  PE’s  are  much  larger  than 
the  swi  .ches, 
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cttch  switch  contains  enough  local  memory  to  store  several  configuration 
settings. 

(a)  (b) 


Figure  2.  Two  lattices.  Circles  represent  switches;  squares  represent 
processing  elements;  lines  represent  data  paths. 

A  configuration  setting  is  an  instruction  which,  when  invoked,  causes 
the  switch  to  form  a  passive  connection  between  any  combination  of  its 
incident  data  paths.  Notice  that  this  is  circuit  switching  rather  than 
packet  switching  and  that  fan  out  is  possible  at  the  switches.  Figure  3(a) 
shows  the  configuration  settings  for  a  mesh  pattern  for  the  lattice  of  Fig¬ 
ure  2(a);  Figure  3(b)  shows  the  same  lattice  configured  as  a  binary  tree. 
To  implement  an  interconnection  pattern,  the  switches  are  loaded  with 
configuration  settings  by  an  external  control  processor  via  a  "skeleton" 
that  is  transparent  to  this  discussion.  This  activity  is  usually  performed 
in  parallel  with  the  controller's  loading  of  the  PE  programs. 
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(b) 
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Figure  3.  Two  configurations  of  the  lattice  in  Figure  2(a). 

A  parallel  program  is  viewed  as  the  composition  of  several  parallel 
algorithms  each  with  its  own  processor  interconnection  pattern.  Each  of 
these  interconnection  patterns  and  the  associated  PE  code  is  called  a 
"phase. "  The  controller  loads  the  PE’s  and  switches  with  the  instructions 
for  several  phases.  Processing  begins  with  a  broadcast  command  from 
the  controller  to  the  switches  to  invoke  a  particular  stored  interconnec¬ 
tion  pattern.  This  also  causes  the  PE’s  to  begin  synchronously  executing 
their  local  programs.  The  interconnection  structure  remains  static 
throughout  the  execution  of  the  phase.  When  the  phase  completes, 
another  broadcast  command  causes  a  different  interconnection  pattern 
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to  be  invoked  and  a  new  phase  to  be  initiated.  The  action  continues  in 
this  manner  from  phase  to  phase. 

Several  points  are  worthy  of  special  emphasis.  First,  to  implement 
an  interconnection  pattern  requires  that  all  configuration  settings  be 
stored  in  the  same  location  in  all  of  the  switches,  This  is  so  that  the 
broadcast  command  can  take  the  simple  form  "invoke  the  setting  in  loca¬ 
tion  thus  making  possible  one  step  phase  transitions.  Second, 
switches  can  provide  the  ability  for  data  paths  to  "crossover"  one 
another,  i.e.,  a  setting  can  implement  multiple  data  path  interconnec¬ 
tions.  Third,  the  PE’s  need  not  know  to  whom  they  are  connected;  they 
simply  execute  instructions  of  the  form  READ  EAST,  WRITE  NORTHWEST, 
etc.  The  interconnection  pattern  explicitly  implements  the  routing. 
Fourth,  the  data  paths  are  bidirectional. 

Example:  Consider  the  problem  of  finding  the  solution  to  a  system  of 
linear  equations,  Ax =6,  where  A  is  an  nxn  band  matrix  of  width  p  and 
b  is  an  n  vector.  To  solve  the  problem  we  use  the  Kung-Leiserson  LU 
decomposition  pipelined  (systolic)  algorithm  [4]  and  their  lower  tri¬ 
angular  system  (LTS)  solver  algorithm.  The  interconnection  pattern 
(for  p=4)  is  shown  in  Figure  4.  The  exact  operation  of  the  algorithms 
is  unimportant  except  to  say  that  they  are  pipelined  and  the  data 
moves  in  the  direction  of  the  arrows.  Phase  1  decomposes  A  into 
lower  and  upper  triangular  matrices,  A=Llf,  and  at  the  same  time 
solves  the  lower  triangular  system,  Ly-b .  Figure  5  shews  the  embed¬ 
ding  into  the  lattice  of  Figure  2(a)  of  these  two  algorithms  —  the  l 
matrix  is  transferred  directly  from  the  decomposition  processor  to 
the  LTS  solver.  The  *  vector  result  can  be  formed  by  solving  Ux-y, 
which  is  done  by  rewriting  U  as  a  lower  triangular  matrix  and  using 
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Vk+3  Uk,k+2  Vk+1  Uk,k  yr 


The  embedding  of  the  LU-DccomposiUon  processors  (I-IG) 
and  the  lower  triangular  system  solver  (A-D)  of  Figure  4  in 
the  lattice  of  Figure  2(a).  The  embedding  appears  m  the 
North-West  corner  of  the  lattice. 


FIGURE  S. 
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the  LTS  algorithm,  but  U  must  be  completely  generated  before  being 
rewritten.  Thus,  phase  1  saves  the  U  matrix  and  y  vector  values  in 
preparation  for  phase  2  by  threading  them  through  the  lattice.  (See 
Figure  6.)  In  phase  2  the  values  are  threaded  back  through  the  lat¬ 
tice  in  the  opposite  direction,  which  effects  the  rewriting  operation, 
and  they  are  input  to  another  LTS  solver.  (See  Figure  ?.)  The  result 
exits  from  the  array  at  the  left  end  of  the  LTS  solver. 
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Figure  6. 


Threading  the  U  matrix  Figure  7. 
and  y  vector  values  of 
Phase  1.  Switches  not 
shown. 


Reverse  threading  of  the 
U  and  y  values  for 
Phase  2  tog  ether  with  a 
second  LTS  solver. 


The  example  is  specialized  to  a  band  matrix  of  width  p=  4.  A  general 
*  procedure  that  solves  this  problem  for  arbitrary  width  bands  would  differ 
only  in  the  interconnection  structure;  the  various  PE  programs  required 

for  an  arbitrary  width  solution  are  all  represented  in  this  p  =4  case.  Thus,  v"* 

r — 

it  is  the  programming  of  interconnection  patterns  that  is  of  central  'r_l 

importance. 


Programming  Interconnection  Patterns 

We  will  emphasize  the  specification  of  uniform  rather  than  ad  hoc 
interconnection  patterns  because  they  are  of  interest  in  their  own  right 
and  they  are  often  the  building  blocks  that  are  used  by  the  less  regular 
patterns.  First,  we  must  consider  the  lattice  that  is  to  host  the  intercon¬ 
nection  pattern. 

As  indicated  in  Figure  2,  a  variety  of  different  lattices  are  possible, 
although  any  particular  architecture  will  use  only  one.  Lattices  differ  in 
complexity  in  several  ways:  corridor  width,  degree,  and  crossover  capa¬ 
bility.  The  corridor  width,  w,  is  the  number  of  switches  separating  two 
adjacent  PE’s,  e.g.,  the  lattice  of  Figure  2(a)  has  uj=:1  and  that  of  Figure 
2(b)  has  w=2.  Any  lattice  can  embed  an  arbitrary  graph,  but  to  do  so 
may  require  leaving  some  PE's  unused  [9],  A  wider  corridor  width  uses 
PE's  more  efficiently  when  embedding  complex  graphs.  The  degree,  d,  of 
a  lattice  is  the  number  of  data  paths  incident  on  a  PE  or  a  switch.  (If 
these  two  numbers  are  different,  d  is  the  minimum.)  For  example,  Fig¬ 
ure  2(a)  has  d=3  while  Figure  2(b)  has  d  =  4.  Finally,  the  amount  of  cross¬ 
over  capability  c  is  the  number  of  distinct  data  paths  that  can  intersect 
at  one  switch.  A  crossover  capability  c=2  permits  a  crossover  while  c  =  l 
does  not.  In  the  interest  of  generality,  we  will  assume  the  "simplest"  lat¬ 
tice  suitable  for  an  interconnection  pattern. 


Programming  an  interconnection  pattern  requires  that  the 
configuration  setting  of  each  relevant  switch  be  defined.  For  the  present 
discussion  it  suffices  that  we  give  a  logical  specification  of  the  setting 
since  the  actual  bit  configurations  are  irrelevant.  Accordingly,  we  will 
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N(orth) 

S(outh) 

E(as’) 

W(est) 


M(aine)  i.e.,  Northeast 
F(lorida)  i.e.,  Southeast 
A(rizona)  i.e.,  Southwest 
O(regon)  i.e.,  Northwest 


and  we  will  assign  settings  as  pairs  of  these  letters.  For  example,  EW  is  a 


horizontal  connection  while  ME  is  a  45  0  angle.  The  lattice  will  always  be 
nxn  where  n  is  the  number  of  processors  on  a  side.  We  name  the 
switches  and  PE’s  with  a  two  value  index  corresponding  to  its  matrix  posi¬ 
tion.  See  Figure  8.  We  will  name  the  lattice  "L". 


Figure  8.  The  two  index  coding  scheme  for  a  lattice. 

As  an  example  of  this  specification  method,  we  observe  that  the  mesh 
interconnection  pattern  (Figure  3(a))  can  be  defined*  by  the  two  condi¬ 
tions: 

(i)  i  is  odd  and  j  is  even  implies  L[i,;‘]  =  NS 

(ii)  i  is  even  and  j  is  odd  implies  =  EW 


00 

h- 

r*~! 


•In  our  presentation  of  interconnection  patterns,  we  will  use  a  simple  declarative 
specification.  We  are  presently  developing  a  configuration  programming  language, 
but  until  it  is  completed,  we  prefer  the  neutral  declarative  approach. 
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provided  that  the  lattice  is  initially  unconfigured,  A  hexagonally  connect¬ 
ed  Interconnection  pattern  requires  the  further  condition 
(iii)  i  is  odd  and  j  is  odd  implies  L[i,j]  =  OF 

and  requires  a  lattice  of  degree  d=6  or  (for  symmetry)  d=B.  Notice 
that  this  specification  is  somewhat  more  general  than  that  used  in 
Figure  5. 

Torus  Interconnection  Patterns 

Since  the  nxn  torus  interconnection  pattern  is  simply  an  nxn 
mesh  with  the  top  row  and  bottom  row  PE’s  connected  and  the  left 
column  and  right  column  PE's  connected,  (see  Figure  l),  one  might 
expect  a  one  corridor,  degree  4,  crossover  capable  (c  =2)  lattice  to 
suffice  to  host  this  pattern.  Surprisingly,  it  does  not. 

Theorem.  Let  L  be  a  vjs=1,  d  =  4,  c  =2  nxn  lattice.  L  cannot  be  set 
to  connect  the  PE’s  into  an  nxn  torus. 

The  proof  involves  arguing  that  the  perimeter  corridors  must  be  used 
for  two  purposes  -  to  support  both  the  vertical  and  horizontal  "wrap 
around”  and  thus  cannot  lead  to  an  edge  disjoint  graph  embedding. 

Direct  Torus  Representation.  Even  when  d=B,  embedding  the  torus 
is  not  trivial  if  we  are  to  avoid  multiple  use  of  data  paths. 

Lattice.  u>  =  l,  d=B,  c  =2. 

Settings  for  Crossover  Level  1. 

First  we  connect  the  PE’s  in  the  rows.  Then  we  run  a  data  path  from 
the  Northeast  part  of  the  first  PE  through  the  corridor  above  the  row 
and  finally  down  into  the  Northeast  part  of  the  last  PE  in  the  row. 
For  example, 


shows  the  construction  for  conditions  (i)  through  (iii). 

(i)  [PE  row  connections]  l<i,jf<2n  +  l  and  i  is  even  and  j  is  odd 
imply  /.[ijf^iSW. 

(ii)  [Northeast  ports]  ■i<2n  +  l  and  i  is  odd  imply  Z,  [i  .3]= AfiT  and 
L[i,Zn  +  l]=AW. 

(iii)  [Corridor  above  rows]  i<2n  +  l  and  i  is  odd  and  3<^<2n  +  l 
imply  L[i,j]=EW. 

Settings  for  Crossover  Level  3.  A  similar  strategy  is  used  for  the 
columns. 

(iv)  [PE  column  connections]  l<i,ji<2n  +  l  and  i  is  odd  and  j  is 
even  imply  L[i,j]=NS . 

(v)  [Southwest  ports]  ><2n  +  l  and  j  is  odd  imply  A[3,j]=Af5  and 
L[Zn  +  l,j]=NM. 

(vi)  [Corridor  left  of  columns]  jf<2n  +  l  and  j  is  odd  and  3<i<2n  +  l 
imply  L[ij]=NS. 

Figure  9  illustrates  the  entire  construction. 

The  difficulty  with  this  interconnection  pattern,  of  course,  is  that 
it  has  long  data  paths  that  are  subject  to  propagation  delay.  Some 
algorithms  can  accept  such  a  delay,  but  generally  we  would  like  to 
reduce  it.  Accordingly,  we  prefer  the  following  more  intricate  pat¬ 
tern  that  interleaves  the  row  and  column  processing  elements  so 
that  there  is  a  fixed  bound  on  the  distance  a  signal  must  travel. 


in 

r- 

< — I 


Figure  9,  Direct  embedding  of  the  torus  into 
the  lattice  of  Figure  2(a).  Edges 
of  liko  color  intersecting  at  a 
switch  nro  connected. 


Figure  10.  Interleaved  embedding  of  the  torus 
into  the  lattice  of  Figure  2(a). 
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Interleaved  Torus  Representation 
Lattice.  xu  =  i,  d=B,  c  =2. 


Settings  for  Crossover  Level  I. 

First  we  connect  alternate  PE's  in  rows.  For  example, 


The  end  connections  are  specified  by 

(i)  [East  port,  end  PE's]  i  is  even  implies  Z,[i,3]=£'»r  and 
L[i.2n  +  i]=  WO. 

The  westerly  port  connections  of  each  Pi’  are  given  ty 

(ii)  [West  port]  i  is  even  and  3</<2n  +  l  and  j  is  odd  imply 
L[i,j]*OE. 

The  connections  in  the  corridor  above  the  row  are  given  by 

(iii)  [Northeast  port]  t<2n  +  l  and  i  is  odd  and  3<j  <2n  +  l  and  j  is 
odd  imply  L[i,j]=AE. 

(iv)  i<2n  +  1  and  i  is  odd  and  3<j  and  j  is  even  imply  L[i,j]-WF. 

Settings  for  Crossover  Level  2.  The  columns  are  connected  in  a 
manner  analogous  to  the  rows. 

(i)  [South  port,  end  PE’s]  j  is  even  implies  L[3,j]=NS  and 
L[2n  +  l,jf  ]=G/V. 

(ii)  [Northport]  j  is  even  and  3<i<2n  +  l  and  i  is  odd  imply 
L[i,j)=OS. 

(iii)  [Southwest  port]  j<2n  +  l  and  j  is  odd  and  3si<2n  +  l  and  t  is 
odd  imply  L[i,j]-SM. 
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(iv)  jl<?n  + 1  and  j  is  odd  and  3«  and  i  is  even  imply  L[ij ]=NF. 
The  entire  c  onstruction  is  shown  in  Figure  10. 

Clearly  the  maximum  number  of  switches  that  any  data  item  must 
pass  through  is  three.  V/e  have  increased  the  locality  of  the  torus 
embedding.  It  is,  therefore,  more  amenable  VLSI  implementation 
and  can  be  used  in  an  arbitrarily  large  lattice  with  only  a  constant 
delay. 


Complete  Binary  Trees 

Although  an  efficient  embedding  of  complete  binary  trees  into 
the  plane  is  known  [10],  its  direct  application  to  interconnection  pat¬ 
tern  programming  is  very  wasteful.  (See  Figure  11.)  In  fact,  since  a 
complete  binary  tree  of  depth  m  has  2"1-!  nodes,  we  can  expect  a 
lattice  with  2*x2*  PE's  to  host  a  complete  binary  tree  of  depth  2 k 
with  one  unused  node.  Call  this  node  a  "spare."  We  can  expect  that 
the  simplest  lattice  hosting  this  pattern  will  not  require  crossover 
capability,  since  trees  are  planar,  and  will  require  only  degree  d=4, 
since  trees  have  at  most  degree  3  connections.  (The  lattice  then  is 
given  by  -uj  =  1 ,  d=4,  c  =  i.)  But  if  the  reader  attempts  to  develop  an 
interconnection  with  these  conditions,  he  will  find  it  to  be  unexpect¬ 
edly  difficult. 

The  overall  strategy  is  to  begin  with  small,  complete  binary  trees 
embedded  in  square  regions  of  the  lattice.  To  reduce  propagation  delay 
the  root  will  be  placed  in  the  center  of  the  block.  Each  block  will  contain 
a  spare  PE.  We  compose  four  such  square  blocks  together  to  form  a 
larger  binary  tree  in  a  larger  square  block.  Three  of  the  'our  spare  PE’s 
will  be  used  as  nodes  in  the  composed  tree;  the  fourth  spare  will  become 
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Figure  11.  Hyper-H  tree  (Figure  1(d))  embedding  [10].  Filled  PE’s 
are  unused. 

the  spare  of  the  new  block.  The  goal  is  to  place  the  spares  so  that  they 
will  be  conveniently  located  for  the  compositon. 

Define  three  types  of  tree  embeddings: 

Type  A  blocks  have  their  spare  PE  midway  along  one  side  adjacent  to 
the  exiting  edge  from  the  block’s  root. 

Type  B  blocks  have  their  spare  PE  in  the  corner  on  the  same  side  as 
the  exiting  edge  from  the  block's  root. 

Type  C  blocks  have  their  spare  PE  in  the  corner  on  the  opposite  side 
of  the  exiting  edge  from  the  root. 
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Figure  12  illustrates  the  three  types  of  blocks  and  demonstrates  that 
they  can  be  inductively  produced  using  blocks  of  these  types. 


Type  A  Type  B *  Type  C* 


Figure  12.  Schematic  of  blocks  composed  to  form  larger  blocks.  Solid 
squares  represent  original  roots;  open  squares  represent 
spares.  Superscript  "F"  means  reflect  with  respect  to  hor¬ 
izontal  axis  (flip);  superscript  "R"  means  reflect  with  respect 
to  vertical  axis  (reverse) 

Notice,  thet  as  part  of  the  inductive  hypothesis,  we  must  argue  that 
the  perimeter  sv'tohrs  a^e  available  for  routing  the  new  edges.  This  is 
obviously  true  if  they  are  available  in  the  basis  blocks.  The  smallest 
blocks  that  we  have  been  eble  to  find  with  this  property  are  4x4  blocks 
embedding  15  node  binary  trees,  These  are  illustrated  in  Figure  13. 

The  conceptual  algorithm  is  cleur.  Refer  to  Figure  14.  Begin  with  an 
objective  block  type,  e  g.,  Type  B,  and  a  lattice  of  size  2*x2*  PE’s.  Recur¬ 
sively  embed  the  four  subtrees  in  lattices  of  size  2*'lx2fcl  such  that  the 
proper  block  types  are  selected.  In  the  basis  cases  (28x28),  use  an  explicit 
embedding.  Notice  that  the  results  may  require  reflection,  Connect  the 
three  spares  by  appropriate  switch  settings.  This  latter  operation  is 
always  possible  based  on  an  inductive  argument  that  depends  upon  two 
facts: 
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Figure  13.  Basis  blocks  for  planar  binary  tree  embedding. 

(a)  After  the  basis  connection,  all  spares  have  their  origin  as  Type  C 
basis  block  elements,  and 

(b)  None  of  the  switches  surrounding  a  Type  C  basis  block  spare  is 
used  and  so  there  are  three  directions  of  access. 

This  guarantees  that  the  three  data  paths  can  always  be  assigned.  The 
detailed  program  is  omitted. 

Clearly,  we  have  achieved  our  goal  of  complete  PE  usage  of  this  sim¬ 
ple  lattice.  If  the  available  lattice  were  more  complex,  e.g.,  had  degree  8 
or  multiple  corridors,  then  the  same  embedding  would  work  and  some 
minor  optimizations  would  be  possible. 

Lacing  a  Cbrridor 

Although  we  could  present  many  more  of  our  embeddings  -  a  broad¬ 
cast  tree,  a  double  tree,  leaves  on  a  line  t  ree,  shuffle  exchange,  etc.  -  it  is 
perhaps  more  instructive  to  illustrate  a  technique  that  gives  unexpected 
power  lor  programm  ag  complex  graphs.  It  is  called  "lacing  a  corridor" 
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and  it  takes  optimum  advantage  of  a  fixed  architectural  resource,  the 
corridor  width. 

Suppose  one  is  embedding  an  interconnection  pattern  and  must 
move  a  large  number  of  distinct  data  paths  across  a  region  of  the  lattice. 
By  definition,  the  corridor  width,  to,  is  the  number  of  switches  separating 
adjacent  PE's.  Thus,  if  the  degree  d=4,  then  to  distinct  data  paths  can  be 
routed  between  a  pair  of  PE's,  it  would  appear  that  for  the  degree  d=B 
lattice,  to  distinct  data  paths  are  still  the  maximum  that  can  be  routed 
down  u  corridor.  But  we  can  do  much  better. 

The  idea  behind  lacing  is  to  begin  with  straight  data  paths  down  a 
corridor  and  then  to  add  zig-zag  paths  that  exploit  the  higher  degree  and 
the  crossover  capability  of  the  switches.  For  example,  Figure  15 

OOOOOOOO 
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Figure  15.  Lacing  ten  distinct  data  paths  through  four  switches, 
shows  a  w  =4,  d  =0,  c=3  lattice  in  which  ten  distinct  datu  paths  have  been 
squeezed  through  the  four  available  switches!  This  is  the  maximum  possi¬ 
ble  since  the  bisection  width  of  this  portion  of  the  lattice  is  ten.  (Bisec¬ 
tion  width  is  a  concept  introduced  by  Thompson  [11]  referring  to  the 
minimum  number  of  wires  cut  by  a  line  bisecting  a  VLSI  layout.)  If  we 
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expand  our  scope  somewhat  and  include  the  switches  that  bound  the  cor¬ 
ridor,  then  we  can  increase  the  number  ol  distinct  paths  by  two.  (We  will 
ignore  this  optimization  in  the  lacing  definition  below.) 

Lattice.  u»>i,  d=B,  c  =3. 

The  construction  is  limited  to  a  region  bounded  by  four  PE’s.  The  upper 
left  hand  corner  PE  is  L[r,s]. 

Settings  for  crossover  level  1 .  [Horizontal  Path] 

(i)  lsisiij  and  Os;'«to+l  imply  L[r+i,s+i]sEW . 

Settings  for  crossover  level  <?.  [Dotted  Path] 

(ii)  lsisiu-l  and  +  l  and^  is  even  imply  L[r+i,s+j]-AF. 

(iii)  lsisuo-i  and  O^jstu  +  l  and  j  is  odd  imply  L[r+i  +  i,s+;]=0Af. 
Settings  for  crossover  level  3,  [Dashed  Path] 

(iv)  ljsijstu-i  and  Osys-w  +  l  and  j  is  even  imply  L[r+t+l,s+j]=0Af. 

(v)  and  Osjseiu  +  i  and;  is  odd  imply  L[r+t  ,s+ji]=ylF. 

Notice  that  if  the  switches  had  even  higher  crossover  capability  c=4, 
which  is  the  maximum  for  degree  8  switches,  then  we  could  even  route 
vertical  wires  across  the  laces  if  they  were  needed. 

Conclusions 

We  have  introduced  the  CHiP  architecture  and  argued  that  its  provi¬ 
sion  for  interconnection  pattern  programming  alleviates  many  of  the 
diiriculties  encountered  in  parallel  program  development.  This 
simplification  is  achieved  in  two  ways.  First,  the  rigidity  of  a  fixed  inter¬ 
connection  structure  is  no  longer  an  obstacle  when  one  wants  to  program 
an  algorithm  that  uses  a  different  interconnection  pattern.  And 
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secondly,  there  is  a  clean  separation  between  routing  the  data  and  pro¬ 
gramming  the  activity  of  the  PE's. 

Additionally  we  have  demonstrated  that  interconnection  program¬ 
ming  is  an  interesting  and  challenging  activity.  We  wave  shown  that  local¬ 
ity  an  be  increased  by  careful  study  of  the  torus.  We  have  shown  that  it  is 
possible  to  embed  the  complete  binary  tree  to  achieve  essentially  com¬ 
plete  PE  utilization.  The  result  involves  an  interesting  assignment  of 
spare  PE’s.  And  we  have  sho*"n  that  there  are  general  techniques  (e.g., 
corridor  lacing)  to  be  found. 
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